Cellcano: supervised cell type identification for single cell ATAC-seq data

Computational cell type identification is a fundamental step in single-cell omics data analysis. Supervised celltyping methods have gained increasing popularity in single-cell RNA-seq data because of the superior performance and the availability of high-quality reference datasets. Recent technological advances in profiling chromatin accessibility at single-cell resolution (scATAC-seq) have brought new insights to the understanding of epigenetic heterogeneity. With continuous accumulation of scATAC-seq datasets, supervised celltyping method specifically designed for scATAC-seq is in urgent need. Here we develop Cellcano, a computational method based on a two-round supervised learning algorithm to identify cell types from scATAC-seq data. The method alleviates the distributional shift between reference and target data and improves the prediction performance. After systematically benchmarking Cellcano on 50 well-designed celltyping tasks from various datasets, we show that Cellcano is accurate, robust, and computationally efficient. Cellcano is well-documented and freely available at https://marvinquiet.github.io/Cellcano/.


Supplementary Figures
Supplementary Figure 1     Supplementary Figure 12 A diagram shows our procedure of data preprocessing and data analysis.
Download scATAC-seq raw data (fragment files or bam files) Genome liftOver (hg19 for human PBMCs datasets; mm10 for mouse brain datasets) Generate genome-wide bins, peaks and gene scores by ArchR Curate cell types in human PBMCs and mouse brain datasets

Supplementary Note 1: Details on designing celltyping tasks
In total, we designed 50 celltyping tasks involving different individuals as reference and target datasets from six datasets (four human PBMCs datasets and two mouse brain datasets). We design the celltyping tasks to mimic the following prediction scenarios: -Intra-dataset individual prediction: users have one confidently annotated scATAC-seq profile from one individual and want to use it to annotate all other individuals from the same study. -Inter-dataset individual prediction: users have one confidently annotated scATAC-seq profile from one individual and want to use it to annotate other individuals from different studies. In the mouse brain celltyping tasks, a special case is that we have tasks not only for a different subject but also for a different brain region because mouse brain has several brain regions. We count them into this category. -Inter-dataset prediction (combined reference): users have several well annotated scATAC-seq datasets and wish to use a large collection of public datasets to increase the reference data size and improve the prediction result. This is based on our previous research 5 where we found that combining individuals or datasets as reference could lead to better prediction results. -Inter-dataset prediction (combined target): users have scATAC-seq data from multiple batches and want to determine their cell types in one run using a given reference.
We have one more task design which is Inter-dataset prediction (Ground truth) where we use the FACS-sorted human PBMCs dataset as target dataset. Since the FACS-sorted human PBMCs dataset can be considered as the ground truth, we use this category to better evaluate how Cellcano predicts compared to all other methods. However, this category will not appear in real cell type prediction scenario.

Supplementary Note 2: An introduction to different ArchR gene score models
The script to generate gene score models are provided by ArchR 6 (https://github.com/GreenleafLab/ArchR_2020). In total, there are eight categories of gene score models including: (1) Model -Promoter: This class of models count the reads located on the promoter region with different window sizes.
(2) Model -GeneBody: This class of models count the reads located on the whole gene body with certain extension in up-or down-stream.
(3) GeneModel -Constant: This class of models count reads from 1K bps upstream transcription start site (TSS) and different bps downstream TSS. The constant gene model considers each read having the same weight as 1.
(4) GeneModel -TSS -Exponential: This class of models extract reads from 1K bps upstream and 100K bps downstream TSS. Gene boundaries are set so that reads from one gene body will not overlap with other gene bodies. Then, an exponential decay function is used to weight the reads from each windowed tile based on the distance to TSS. The exponential decay function is demonstrated as exp (− !"#(%&#'!()*) ,&(%-, + exp (−1)) with different window parameters. (5) GeneModel -TSS -NoBoundary -Exponential: Same as (4) except no gene boundaries are set. (6) GeneModel -GB -Exponential: Same as (4) except the distance in the exponential decay function is calculated based on the distance to gene bodies instead of TSS. Gene boundaries are set in this class of models. (7) GeneModel -GB -Exponential -Extend: Same as (6) except the gene bodies are extended. The distance in the exponential decay function is calculated based on the extended gene bodies. The gene score model recommended by ArchR lies in category (7). It integrates the signals from the gene body with TSS extended 5kb in the upstream direction. Then, it weights the reads outside the gene body region and use the window parameter as 10,000.